Skip to content

Implement abortIsolate() in cloudflare:workers module#6237

Open
hoodmane wants to merge 2 commits intomainfrom
hoodmane/reload-worker
Open

Implement abortIsolate() in cloudflare:workers module#6237
hoodmane wants to merge 2 commits intomainfrom
hoodmane/reload-worker

Conversation

@hoodmane
Copy link
Contributor

@hoodmane hoodmane commented Mar 4, 2026

Add abortIsolate(reason) API that terminates the current JS isolate and
creates a fresh one from scratch, resetting all module-level state. When
called, it immediately terminates all in-flight requests on the isolate.
The next request creates a new Worker.

The abort-all mechanism uses a shared ForkedPromise that each IoContext
subscribes to via onLimitsExceeded(). When abortIsolate() is called, the
promise is rejected, causing every IoContext on the isolate to abort. This
mirrors how the production 2x memory limit kill works. The reason string is
included in all error messages across all aborted requests.

@hoodmane hoodmane requested review from a team as code owners March 4, 2026 16:45
@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 4, 2026

ResolveMessage: Cannot find module '@opencode-ai/plugin' from '/home/runner/work/workerd/workerd/.opencode/tools/bazel-deps.ts'

github run

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 4, 2026

@hoodmane Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

@hoodmane hoodmane force-pushed the hoodmane/reload-worker branch from ba459c8 to 1f9a0ea Compare March 4, 2026 16:47
@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 4, 2026

ResolveMessage: Cannot find module '@opencode-ai/plugin' from '/home/runner/work/workerd/workerd/.opencode/tools/bazel-deps.ts'

github run

@hoodmane hoodmane force-pushed the hoodmane/reload-worker branch 3 times, most recently from f11cf40 to 2adf211 Compare March 4, 2026 16:54
Factor out the Isolate -> Script -> Worker creation pipeline from
makeWorkerImpl() into a new Server::createWorker() method. This
separates the worker creation logic (inspector policy, module registry,
fallback service, artifact bundler, script compilation, worker
construction) from the validation checks and post-creation wiring that
are specific to makeWorkerImpl().

The experimental feature flag checks (NMR, module fallback, Python/NMR
incompatibility) remain in makeWorkerImpl() since they only need to run
once at initial config validation time.

No behavioral changes.
@hoodmane hoodmane force-pushed the hoodmane/reload-worker branch 3 times, most recently from 8769c83 to abcc698 Compare March 4, 2026 17:11
@kentonv
Copy link
Member

kentonv commented Mar 4, 2026

We need to be a bit careful with this as overuse could lead to excessive isolate creation, especially if requests are long-running (like WebSockets).

For your use case would it be OK if we actually terminated the worker, erroring in-flight requests? I would be more comfortable with that as it can't cause a build-up of condemned isolates so easily.

If that works for you, I would call it abortIsolate(), consistent with the existing ctx.abort(). It should probably take a "reason" as a parameter, which will be thrown from all the in-flight requests.

@hoodmane
Copy link
Contributor Author

hoodmane commented Mar 5, 2026

overuse could lead to excessive isolate creation

My thought was that this should behave in exactly the same way as when the worker allocates 2x the memory limit and gets condemned. That way, it doesn't offer any additional surface area to worry about.

For your use case would it be OK if we actually terminated the worker, erroring in-flight requests?

Is that what allocating too much memory does? The current situation is that every subsequent request fails until the worker is retired, so whatever we do will be an improvement over that -- better for just the in-flight requests to error than for however many future requests to fail.

@jasnell
Copy link
Collaborator

jasnell commented Mar 5, 2026

The way the memory limit works is that if the worker hits 1x the memory limit it is condemned, which allows it to complete the current work but will tear down the isolate once that work is complete. New requests will go to a fresh non-condemned isolate. If, while cleaning up, the condemned work goes on to hit 2x the memory limit, it gets immediately destroyed. @kentonv's concern is that if we just have this condemn the isolate, we might have a case where we have a ton on condemned isolates all just finishing processing of a single request in the worst case and are constantly spinning up new ones to replace it which causes a large amount of churn.

Instead, this should work exactly like hitting the 2x limit. All activity being processed by the current isolate is interrupted immediately and the isolate is destroyed.

@hoodmane
Copy link
Contributor Author

hoodmane commented Mar 6, 2026

In the Python use case the isolate won't be useful for anything anyways so it's definitely better to immediately destroy it and return errors.

Add abortIsolate(reason) API that terminates the current JS isolate and
creates a fresh one from scratch, resetting all module-level state. When
called, it immediately terminates all in-flight requests on the isolate.
The next request creates a new Worker.

The abort-all mechanism uses a shared ForkedPromise that each IoContext
subscribes to via onLimitsExceeded(). When abortIsolate() is called, the
promise is rejected, causing every IoContext on the isolate to abort. This
mirrors how the production 2x memory limit kill works. The reason string is
included in all error messages across all aborted requests.
@hoodmane hoodmane force-pushed the hoodmane/reload-worker branch from abcc698 to 3f64ddb Compare March 6, 2026 13:32
@hoodmane
Copy link
Contributor Author

hoodmane commented Mar 6, 2026

Okay updated the PR to kill all in-flight requests.

@hoodmane hoodmane changed the title Implement resetWorker() in cloudflare:workers module Implement abortIsolate() in cloudflare:workers module Mar 6, 2026
@hoodmane
Copy link
Contributor Author

hoodmane commented Mar 9, 2026

@jasnell @kentonv would appreciate another look at this

@kentonv
Copy link
Member

kentonv commented Mar 9, 2026

This seems to have a huge amount of churn in server.c++ which I am not excited about. I guess it's because previously Workers were constructed entirely at startup time (except for dynamic workers) and so being able to recreate them requires some refactoring. Still, I would want to spend some time looking for an easier way before accepting this.

Was this written by you or by AI?

server.c++ is specific to workerd, so that churn doesn't even benefit production. How much do we need this function to be supported in workerd, vs. just in prod?

Either way you need a separate change in the internal codebase for this to be supported in production. (It should actually be a lot easier there, though, since runtime loading and reloading of Workers is the norm there.)

@danlapid
Copy link
Collaborator

danlapid commented Mar 9, 2026

This seems to have a huge amount of churn in server.c++ which I am not excited about. I guess it's because previously Workers were constructed entirely at startup time (except for dynamic workers) and so being able to recreate them requires some refactoring. Still, I would want to spend some time looking for an easier way before accepting this.

Was this written by you or by AI?

server.c++ is specific to workerd, so that churn doesn't even benefit production. How much do we need this function to be supported in workerd, vs. just in prod?

Either way you need a separate change in the internal codebase for this to be supported in production. (It should actually be a lot easier there, though, since runtime loading and reloading of Workers is the norm there.)

As to the latter part of this comment - I asked for local dev support here.
We know this is much simpler in prod.
The reason I asked for local dev support is because we want these features to be reliably working and testable in workerd so that we can benefit from them in our test suites in workers-rs and workers-py.
Moreover we want a reliable way to recover from these errors in local dev as well, ideally one that matches production for when users see these in their testing suites.

export class ServiceStub {}

export function waitUntil(promise: Promise<unknown>): void;
export function abortIsolate(reason?: string): never;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the return value is void?

Suggested change
export function abortIsolate(reason?: string): never;
export function abortIsolate(reason?: string): void;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function also needs to be added into the types folder (and later we need just generate-types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never is correct since the function does not return.

containerEgressInterceptorImage(kj::mv(containerEgressInterceptorImageParam)),
isDynamic(isDynamic) {}
isDynamic(isDynamic) {
resetAbortAllPromise();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a comment above this line?

Comment on lines +3469 to +3471
if (workerFactory == kj::none) {
JSG_FAIL_REQUIRE(Error, "abortIsolate() is not supported for this worker configuration.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
if (workerFactory == kj::none) {
JSG_FAIL_REQUIRE(Error, "abortIsolate() is not supported for this worker configuration.");
}
JSG_REQUIRE(workerFactory != kj::none, Error, "abortIsolate() is not supported for this worker configuration.");

@kentonv
Copy link
Member

kentonv commented Mar 9, 2026

OK, if we need local dev... I kind of wonder whether local dev could work by just aborting the process?

(Unless it's a dynamic isolate, which should be easy to throw away so that it gets recreated next time.)


// Immediately terminate V8 execution so no further JS runs in this request.
// This raises an uncatchable exception in V8, causing the request to fail immediately.
jsg::Lock::from(v8::Isolate::GetCurrent()).terminateExecutionNow();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use jsg::Lock::from(). Have the first argument to the method be jsg::Lock& js -- it'll be filled in by the JSG layer.

@kentonv
Copy link
Member

kentonv commented Mar 12, 2026

Can we convince Bonk to review this?

kj::Maybe<kj::String> moduleFallbackForFactory;
if (!isDynamicForFactory && isModulesSourceForFactory) {
auto msg = kj::heap<capnp::MallocMessageBuilder>();
msg->setRoot(def.featureFlags);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy seems unnecessary: Since it's not a dynamic worker, the def should be long-lived.

auto msg = kj::heap<capnp::MallocMessageBuilder>();
msg->setRoot(def.featureFlags);
featureFlagsMsgForFactory = kj::mv(msg);
ownedSourceForFactory = kj::heap<OwnedWorkerSource>(def.source);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copying the whole source code of the worker! Just to support this feature. It's probably unnecessary (see previous comment).

@kentonv
Copy link
Member

kentonv commented Mar 12, 2026

The changes to server.c++ are too deep to be trusted to AI I think. Just spot checking it's pretty "sloppy" (see above comments).

This need more than just a review loop where we all post comments, I think it needs a human engineer to go in and work on the code directly.

@anonrig
Copy link
Member

anonrig commented Mar 12, 2026

/bigbonk roast this pr

Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review findings

This is an AI-generated review and may contain inaccuracies. The findings below focus on correctness and safety issues in the implementation.

Summary

The overall design (abort-all via ForkedPromise, lazy recreation in startRequest()) is sound and mirrors the existing 2x-memory-limit kill path well. However, there are several correctness issues in the OwnedWorkerSource deep-copy, unresolved prior review comments, stale comments, and missing steps in the worker factory that need attention. See inline comments.

kentonv's concern about the amount of churn in server.c++ is well-founded -- the OwnedWorkerSource class and the deep-copy machinery add significant complexity for what is (by the PR's own admission) a workerd-only code path. The suggestion to explore simpler approaches (e.g., process restart for non-dynamic workers) deserves serious consideration before investing in fixing all the issues below.


// Immediately terminate V8 execution so no further JS runs in this request.
// This raises an uncatchable exception in V8, causing the request to fail immediately.
jsg::Lock::from(v8::Isolate::GetCurrent()).terminateExecutionNow();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Previously flagged issue not addressed: kentonv commented that jsg::Lock::from() should not be used — instead, the method should take jsg::Lock& js as its first parameter, which JSG fills in automatically.

The current code uses jsg::Lock::from(v8::Isolate::GetCurrent()) which is the discouraged pattern. The fix is straightforward:

Suggested change
jsg::Lock::from(v8::Isolate::GetCurrent()).terminateExecutionNow();
jsg::Lock::from(v8::Isolate::GetCurrent()).terminateExecutionNow();

should become:

  js.terminateExecutionNow();

and the method signature (in the .h) should change to:

void abortIsolate(jsg::Lock& js, jsg::Optional<kj::String> reason);

JSG will automatically provide the jsg::Lock& argument — it won't be visible to JavaScript callers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh.. very helpful bonk

image


// Throws away the current JS isolate and recreates the worker from scratch. The current
// request continues running on the old isolate; subsequent requests will use a fresh
// isolate with re-executed top-level module code and fresh global state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Stale comment. This says "The current request continues running on the old isolate" but the actual implementation (per jasnell/kentonv's feedback) aborts ALL in-flight requests immediately. The comment should read something like:

Throws away the current JS isolate and recreates the worker from scratch. All in-flight requests on the isolate are immediately terminated. Subsequent requests will use a fresh isolate with re-executed top-level module code and fresh global state.

The same stale wording appears in io-channels.h:279-281.

Comment on lines +4518 to +4525
explicit OwnedWorkerSource(const WorkerSource::ModulesSource& ms)
: ownedMainModule(kj::str(ms.mainModule)),
isPython(ms.isPython),
ownedModuleData(cloneAll(ms.modules)),
source(WorkerSource::ModulesSource{
.mainModule = ownedMainModule,
.modules = buildAll(ownedModuleData),
.isPython = isPython,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] OwnedWorkerSource drops several fields from ModulesSource:

  1. capnpSchemas (capnp::List<capnp::schema::Node>::Reader) — not copied. This is a non-owning capnp reader that points into the original config message. While Cap'n Proto modules are marked "TODO(someday)" for workerd, if someone does use a CapnpModule, this will be a dangling reader. At minimum, the field should be initialized to an empty list or an orphan copy.

  2. pythonMemorySnapshot (kj::Maybe<capnp::AnyStruct::Reader>) — not copied. If a Python worker uses abortIsolate(), the recreated worker will be missing its memory snapshot.

  3. CommonJsModule::namedExports (kj::Maybe<kj::Array<kj::StringPtr>>) — the CJS clone at line 4541 only copies .body but drops namedExports. These are non-owning StringPtrs into the body, so they'd need to be reconstructed after cloning the body.

  4. EsModule::ownBody (kj::Maybe<::rust::String>) — line 4535 copies from es.body (the ArrayPtr<const char>) which works if ownBody is kj::none (body points into capnp message). But if ownBody is set (meaning body was transpiled from Rust and points into ownBody), the copy via kj::str(es.body) will correctly capture the content. However, the original ownBody is silently dropped, which is fine since the clone owns the string. Just noting for completeness — this case is actually handled correctly.

Items 1-3 are bugs.

Comment on lines +4857 to +4861
if (!isDynamicForFactory && isModulesSourceForFactory) {
auto msg = kj::heap<capnp::MallocMessageBuilder>();
msg->setRoot(def.featureFlags);
featureFlagsMsgForFactory = kj::mv(msg);
ownedSourceForFactory = kj::heap<OwnedWorkerSource>(def.source);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Previously flagged issue not addressed: kentonv commented that the deep copy of the worker source code here is unnecessary. For a static (non-dynamic) worker, the capnp config message that contains the source lives for the entire lifetime of the Server. The `WorkerSource` borrows from it, and that borrow remains valid.

Rather than the entire `OwnedWorkerSource` machinery (~100 lines), you could just store a `const WorkerSource&` reference to `def.source` (or a copy of the lightweight `WorkerSource` struct itself, which only contains non-owning pointers). The source data it points to lives in the config message, which outlives the `WorkerService`.

If the concern is that `def` gets moved into `linkCallback`, then extract the `WorkerSource` reference before that move (as is already done for other fields). The underlying data remains valid regardless of where the `WorkerSource` struct lives.

This would eliminate the `OwnedWorkerSource` class entirely and address kentonv's concern about unnecessary copies of the entire source code.

Comment on lines +5075 to +5087
auto newWorker = createWorker(workerName, source, featureFlags.asReader(), extensions,
moduleFallbackPtr, kj::mv(compileBindings));

// Compile ctx.exports bindings.
newWorker->runInLockScope(Worker::Lock::TakeSynchronously(kj::none), [&](Worker::Lock& lock) {
JSG_WITHIN_CONTEXT_SCOPE(lock, lock.getContext(), [&](jsg::Lock& js) {
WorkerdApi::from(newWorker->getIsolate().getApi())
.compileGlobals(lock, ctxExportsGlobals, ctxExportsHandle.getHandle(js), 1);
});
{ auto drop = kj::mv(ctxExportsHandle); }
});

return newWorker;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] The worker factory skips several steps that makeWorkerImpl() performs on the original worker:

  1. preloadPython() (line 4735) — not called in the factory. If this is a Python worker, the recreated worker won't have preloaded Python artifacts.

  2. validateHandlers() (line 4774 in the original path) — not called in the factory. Handler validation errors will be silently skipped on recreation.

  3. Actor namespace initialization — the factory calls createWorker() but doesn't call initActorNamespaces() or set up actor channels. If the worker has Durable Objects, the recreated worker may not have them properly configured. (Though the existing WorkerService already has the actor namespaces, so this might be OK if they don't need re-initialization — worth verifying.)

  4. Inspector registrationcreateWorker() does register with the inspector (good), but if the old isolate's inspector session is still active, there could be confusion. Worth considering.

Items 1-2 are potential bugs for Python workers and workers with unusual handler configurations.


// Cloned env binding globals for use by abortIsolate() to recreate the worker with the same
// bindings. Only populated for static (non-dynamic) workers.
kj::Array<WorkerdApi::Global> envGlobalsForFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] This clones the env globals eagerly for every worker, even though it's only used when abortIsolate() is called (which most workers never do). This is the `envGlobalsForFactory` field.

Given that for static workers the `def` is moved into `linkCallback` anyway, and the underlying config data is long-lived, consider whether the globals could be cloned lazily (only on first `abortIsolate()` call) or borrowed from the config message. The `Global::clone()` does deep copies of strings and byte arrays for each binding, which is unnecessary work for the common case.

Comment on lines +2102 to +2104
// If abortIsolate() was called, lazily recreate the worker now (outside any isolate lock).
if (isolateAbortRequested) {
worker = KJ_ASSERT_NONNULL(workerFactory)();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Race window between abortIsolate() and startRequest(). Consider: request A calls abortIsolate(), which sets isolateAbortRequested = true and rejects the abort-all promise. Request B arrives and calls startRequest(), which sees the flag, calls workerFactory() to create a new worker, and resets the flag. Meanwhile, the abort-all rejection is still propagating to other in-flight requests on the old isolate.

The factory call (workerFactory()) is synchronous and happens on the event loop thread, so there's no actual concurrency issue here. But it's worth noting that the factory creates a new V8 isolate synchronously in startRequest(), which blocks the event loop. For workers with large codebases, this could cause noticeable latency on the first request after an abort.

Also: if workerFactory() throws (e.g., compilation error in the worker script), isolateAbortRequested remains true (since the assignment at line 2105 is after the factory call). This means subsequent requests will keep trying to recreate and failing. Consider catching the exception and logging it, or at least resetting the flag regardless.

export class ServiceStub {}

export function waitUntil(promise: Promise<unknown>): void;
export function abortIsolate(reason?: string): never;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Previously flagged by anonrig: this function also needs to be added to the `types/` folder (with `just generate-types`). kentonv confirmed the `never` return type is correct (since the function does not return).

Comment on lines +3488 to +3491
abortAllFulfiller->reject(JSG_KJ_EXCEPTION(FAILED, Error, kj::mv(message)));

// Create a fresh promise/fulfiller for the next generation of requests (after recreation).
resetAbortAllPromise();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] The abort-all mechanism looks correct: rejecting the fulfiller causes all `onLimitsExceeded()` branches to fire, which triggers `abortWhen()` -> `abort()` on each IoContext. The subsequent `resetAbortAllPromise()` creates a fresh pair for the next generation.

One subtlety: between rejecting the old fulfiller and calling `resetAbortAllPromise()`, if any code path called `onLimitsExceeded()` (e.g., a new IoContext being constructed), it would get a branch of the already-rejected promise and immediately abort. This is actually the desired behavior (we want requests on the old isolate to fail), but worth noting in a comment.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 12, 2026

Review posted on PR #6237. Here's a summary of the findings:

HIGH severity

  1. jsg::Lock::from() usage (workers-module.c++:69) -- kentonv flagged this previously and it remains unaddressed. The method should take jsg::Lock& js as its first parameter instead.

  2. OwnedWorkerSource drops fields (server.c++:4518-4525) -- capnpSchemas, pythonMemorySnapshot, and CommonJsModule::namedExports are all silently dropped during the deep copy.

  3. Unnecessary deep copy of source code (server.c++:4857-4861) -- kentonv flagged this: for static workers the capnp config outlives the WorkerService, so the entire OwnedWorkerSource class (~100 lines) could be replaced by a reference/lightweight copy.

MEDIUM severity

  1. Stale comments (workers-module.h:82-84, io-channels.h:279-281) -- Still say "current request continues running" when the implementation aborts all in-flight requests.

  2. Factory skips steps (server.c++:5075-5087) -- Missing preloadPython(), validateHandlers() on recreation.

  3. Eager Global::clone() (server.c++:4073) -- Clones env globals for every worker even though most never call abortIsolate().

  4. Factory error handling (server.c++:2102-2104) -- If workerFactory() throws, isolateAbortRequested stays true, causing every subsequent request to retry and fail.

LOW severity

  1. Missing types/ update -- anonrig flagged this: abortIsolate needs to be added to the types folder.
  2. Abort-all mechanism timing note -- Between reject and reset, new IoContexts would immediately abort (correct behavior but worth documenting).

github run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants